Impact of Image Quality on Machine Print Optical Character Recognition
نویسندگان
چکیده
The National Institute of Standards and Technology (NIST) is in the process of setting up a new series of conferences named the Metadata Text Retrieval Conferences (METTREC). They will focus on evaluating two critical technologies: document conversion using optical character recognition (OCR) and information retrieval (IR). Large collections of document images labeled with correct recognition and retrieval responses are needed to measure performance. Currently, the production of these materials is extremely expensive. NIST is developing a semi-automated truthing tool that will help reduce the cost of data preparation and enable evaluations to scale up. To accomplish this, current OCR technology is needed to produce an initial text to image alignment. This paper describes a small experiment in which three different vendor products (two Windows NT/95-based and one UNIX-based) are evaluated across three sets of document images containing progressively decreasing print and image quality. The evaluation images contain subjectively selected pages from the 1994 Federal Register. Results demonstrate the impact of degrading print and image quality with reported character recognition error rates ranging from 1% to as high as 74%.
منابع مشابه
Machine Print Database
This report describes the NIST Machine Print Database, NIST Special Database 8 (SD8), which contains 360 8-bit gray scale images of pages containing machine printed characters, and a corresponding binary version of each page, resulting in a total of 720 digitized pages. This database is being distributed as a common set of images for use in the development and testing of Optical Character Recog...
متن کاملImprove The Character Detection System Based On Feature Extraction Algorithm
The character recognition is the major important part in the area of document analysis. Character Recognition could be evaluated on printed text and handwritten text. Printed texture could be from a good quality image. In this research work, we implemented in the OCR approach to improve the recognition of character with Classification approach. We work on filtration techniques to improve the pi...
متن کاملOCR - Optical Character Recognition
Character recognition techniques associate a symbolic identity with the image of character. Character recognition is commonly referred to as optical character recognition (OCR), as it deals with the recognition of optically processed characters. The modern version of OCR appeared in the middle of the 1940's with the development of the digital computers. OCR machines have been commercially avail...
متن کاملOffline Handwritten English Script Recognition: A Survey
OFFLINE handwriting recognition is the task of determining what letters or words are present in a digital image of handwritten text. It is of significant benefit to man-machine communication and can assist in the automatic processing of handwritten documents. It is a subtask of Optical Character Recognition (OCR), whose domain can be machine-print or handwriting but is more commonly machine-pri...
متن کاملBlob Detection Technique Using Image Processing for Identification of Machine Printed Characters
Optical character recognition systems have been effectively developed for the recognition of printed characters. Optical character recognition is an awesome computer vision technique with various applications ranging from saving real time scripts digitally and deriving context based intelligence using natural language processing from the texts. One such application is the recognition of machine...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997